A Phonemic Corpus of Polish Child-Directed Speech

نویسندگان

  • Luc Boruta
  • Justyna Jastrzebska
چکیده

Recent advances in modeling early language acquisition are due not only to the development of machine-learning techniques, but also to the increasing availability of data on child language and child-adult interaction. In the absence of recordings of child-directed speech, or when models explicitly require such a representation for training data, phonemic transcriptions are commonly used as input data. We present a novel (and to our knowledge, the first) phonemic corpus of Polish child-directed speech. It is derived from the Weist corpus of Polish, freely available from the seminal CHILDES database. For the sake of reproducibility, and to exemplify the typical trade-off between ecological validity and sample size, we report all preprocessing operations and transcription guidelines. Contributed linguistic resources include updated CHAT-formatted transcripts with phonemic transcriptions in a novel phonology tier, as well as by-product data, such as a phonemic lexicon of Polish. All resources are distributed under the LGPL-LR license.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling

This article presents the original results of Polish language statistical analysis, based on the orthographic and phonemic language corpus. Phonemic language corpus for Polish was developed by using automatic grapheme-to-phoneme conversion of the source orthographic language corpus, obtained from the National Corpus of Polish (NCP). The corpus contains the most frequently used Polish words, wri...

متن کامل

Harmonic cues for speech segmentation: a cross-linguistic corpus study on child-directed speech.

Previous studies on the role of vowel harmony in word segmentation are based on artificial languages where harmonic cues reliably signal word boundaries. In this corpus study run on the data available at CHILDES, we investigated whether natural languages provide a learner with reliable segmentation cues similar to the ones created artificially. We observed that in harmonic languages (child-dire...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

A corpus of European Portuguese child and child-directed speech

We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using...

متن کامل

Automatic Segmentation of Greek Speech Signals to Broad Phonemic Classes

In this paper, we evaluate an implicit approach for the automatic detection of broad phonemic class boundaries of continuous speech signals. The reported method is consisted of the prior segmentation of speech signal into pitch-synchronous segments, using pitchmark locations, for the computation of adjacent broad phonemic class boundaries. The approach’s validity was tested on a phonetically ri...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012